Skip to content

test(kad): use virtual time to resolve flaky bootstrap test#6361

Open
0xsamalt wants to merge 7 commits intolibp2p:masterfrom
0xsamalt:fix-flaky-bootstrap-test
Open

test(kad): use virtual time to resolve flaky bootstrap test#6361
0xsamalt wants to merge 7 commits intolibp2p:masterfrom
0xsamalt:fix-flaky-bootstrap-test

Conversation

@0xsamalt
Copy link
Copy Markdown

Description

This PR fixes intermittent CPU-lag flakiness in Kademlia's bootstrap.rs test suite.

The tests previously measured real-world timeouts using futures_timer::Delay and web_time::Instant, causing them to fail when CI servers inevitably overshot expectations.

By leveraging Tokio's test-util feature and explicitly wrapping our mocks for the tests, #[tokio::test(start_paused = true)] ensures the elapsed test time is completely deterministic and decoupled from wall-clock performance. Tests run instantly and pass 100/100 times in local looping.

cc @jxs @elenaf9

Notes & open questions

  • This is purely an internal test infrastructure fix. Production paths are untouched because futures_timer::Delay is preserved under #[cfg(not(test))].
  • Because this does not affect any public API or library behavior, a changelog entry was intentionally omitted.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation (N/A - Test fix)
  • I have added tests that prove my fix is effective or that my feature works (Modified existing flaky tests)
  • A changelog entry has been made in the appropriate crates (N/A - Internal testing change only)

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 31, 2026

This pull request has merge conflicts. Could you please resolve them @0xsamalt? 🙏

@elenaf9
Copy link
Copy Markdown
Member

elenaf9 commented Apr 21, 2026

Thank you for this PR @0xsamalt!

Deterministic timings in the test scenario would be great. Still, I don't think we should switch the Delay implementation between test and non-test configuration. In the end we want to test the code that is run in production, which uses futures_timer::Delay.

I haven't see the flaky kad test in the wild for a while; did you recently ran into issues with it?

@0xsamalt
Copy link
Copy Markdown
Author

Hey @elenaf9 ,

That completely makes sense! I agree that testing the exact production dependencies is generally much safer.

To answer your question: yes, I did recently run into this flakiness locally. My CPU had a tiny lag spike during the test runner, so a timer that was supposed to hit the assertion at < 10ms overshot to 12ms and panicked.

I attempted to introduce Tokio's start_paused Virtual Time here under the assumption that Kademlia was trying to move aggressively towards deterministic test execution, but I completely understand wanting to keep futures_timer::Delay intact across all environments!

Since the flakiness only seems to pop up on heavily loaded/laggy local machines, how would you like to proceed?

  1. Option A: I can happily close this PR so we don't add unnecessary mock objects just to appease a rare edge case.
  2. Option B: If you'd like to protect against the lag edge-case without swapping the timer, I can rip out the Virtual Time code in this PR and simply widen the time assertion bounds (e.g. subtracting 10ms for lag allowance) on those specific lines so they don't panic on slow runners.

Let me know what you prefer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants